So there are a lot of non-coding RNAs that are not translated, and their three-dimensional structure and subsequent functions are diverse. The most representative of non-coding RNAs is miRNA. Along with transcription factors (TF), micro RNA is a core transcription regulation molecule. In this practice we use ‘miRExpress’ to learn how to analyze miRNA short read data. First of all, we use known miRNA data as reference sequence for aligning the short reads we acquired from sequencing, and we count the short read numbers to quantitate miRNA expression. This process is repeated over many samples to find significant differentially expressed miRNA for each sample. Determining the function of miRNAs that are regulators of target genes is a rigorous and rather difficult task. We make use of databases that contain various information about miRNA and how they are applied to annotating the function of identified miRNAs.
So we’ve got different kinds of non-coding RNAs: micro-RNA (miRNA), long non-coding RNA (lncRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), small interfering RNA (siRNA), etc. miRNA usually consists of 22 bases. miRNA binds to 3’-UTR binding site of target mRNA and mediate rapid degradation of that RNA in order to prevent them from being translated into proteins. This is how expression is regulated with miRNAs. miRNAs are associated with various biological functions including cellular growth, death, differentiation, and oncogenesis. (It’s important for regulating cell viability, cell death, and differentiation along with development and oncogenesis as well.)
miRExpress exercise is provided. miRNA data analysis comprises of several steps:
So we perform expression level retrieval in various samples to obtain expression profile of each sample, then detect miRNA for that particular sample that show significantly differential expression from other samples. In the latter portion of this chapter, we practice detecting differentially expressed miRNA using DESeq2. Lastly we study the characteristic features of each database and familiarize ourselves with how to utilize them in proper situations.
Sequencing reads with adapter\(\xrightarrow{\textrm{adaptor trimming}}\) Adaptor sequence may appear in the start or end or middle of reads \(\xrightarrow{\textrm{reads without adaptor}}\)
Each read align with known miRNAs (in miRBase)
Move to terminal and go to (....0 chapter5/miRNA directory.
If you look at the directory there are these files:
[soh1@x020 miRNA]$ ls
3_adaptor.txt hsa_miRNA.txt hsa_trimmed.txt process.txt
expression_profile.txt hsa_precursor.txt miRNAData read_statistics.txt
hsa.fastq hsa_read_count.txt precursor_structure.txt result
| file name | explanation |
|---|---|
| 3_adaptor.txt | sequence info of 3’ adaptor used in the experiment |
| expression_profile.txt | miRNA expression profile of 6 samples (3N/3T) |
| hsa.fastq | miRNA sequencing data for one sample |
| hsa_miRNA.txt | miRNA mature form sequence downloaded from miRBase |
| hsa_precursor.txt | miRNA precursor sequence downloaded from miRBase |
| precursor_structure.txt | file that contains miRNA precursor structure |
| process.txt | file that contains commands for this particular exercise |
If miRNA expression level of a particular miRNA is high, then that sequence is gonna appear many times in the sequencing raw reads. However, processing all the short reads is not practical, therefore counting how many times a particular short read occurs and performing analysis once for each unique short read will result in shortened analysis time. For example, if ‘AACCGGTT’ occurs 10 times you could abbreviate it as ‘10AACCGGTT’, then we could undergo analysis for the sequence ‘AACCGGTT’ just once. Efficient resource management is important in high-throughput (high-volume) data analysis. find repeat number of every sequence.
For this job, I gotta run a binary called Raw_data_parse included in software on Linux called miRExpress, but it’s not installed in /opt/Yonsei/ so I had to install it under my home directory myself.
I downloaded the latest version of miRExpress from the following page.
tar -xvzf miRExpress.2.1.4.tar.gz
./configure --prefix ~/miRExpress
make
make install
Then the necessary binary files are installed under /home/(user)/miRExpress/bin.
Now, use environment variable to assign the binary.
Raw_data_parse=/home/(user)/miRExpress/bin/Raw_data_parse
export Raw_data_parse
Now you can use the binary anywhere in the environment with $Raw_data_parse.
Find out how many times each read occurs in the fastq file (raw read).
$Raw_data_parse -i hsa.fastq -o hsa_read_count.txt
Now if you look at the content of the resulting output file, it shows white-space delimited file with read count at the very beginning and the read sequence to the next.
[soh1@x020 miRNA]$ head hsa_read_count.txt
1 AACTGGATTTTTGGCGCATCGATTCGTATGTCTTCT
1 AGCGGGAGGAAACACCATCGTATGTCGTCTTCTGTT
1 GCGGCGGAGGGGCCGTCGTATGGCGTCTTCTGCTTG
1 GGGGGCTCGCACGATCGTATGCCGTCTCTTGATTGA
1 TCAGAATTCGATGGCGCGTCGTTCTCGTACGCCGTC
1 TAAAAATTTCGGTTGTTGCATCGTATGCCTTCTTCT
1 CGACTCTTAGGGGTGGTTGTCTCGGGTCGTTTGGCG
1 TCTACTGTCCTACCCTCCCTTATCCCGTCTTCTCTT
1 TTCTGGCTGGTTGAAAAATCGTGTGCCCTCTTTTGT
1 ACTGGACTTGTGTTTTTAAGGCTCGTATGTCTTCTT
Remove adaptor sequence in the read count file (hsa_) with Trim_adapter binary executable. Its option -t specifies 3' adaptor sequence and option -h specifies 5' adaptor sequence.
So if we read what our 3’ adaptor looks like,
[soh1@master miRNA]$ cat 3_adaptor.txt
TCGTATGCCGTCTTCTGCTTG
You can see that there’s only one adaptor sequence present in the file. If there were multiple adapter sequences, the file would look like this:
The input sequences file format-
============================================
Counts Sequences
71 GCGGAAATAGCTTAATGGTAGAGCTCGTATGCCGT
4 AGATTAAGCCATGCATGTTCGTATGCCGTCTTCTG
1 TCGAACAAGTAGGTGTAACTGTTCGTATGCCGTCT
1 AGAGAAGATTGGATAGACGGGAAGTAGTATGCCGT
============================================
Counts and Sequences are divided by Tab(\t).
Usage goes like this:
$Trim_adapter -i {read_count_file.txt} -t {3_prime_adaptor_list.txt} -o {output_file.txt}
Trim_adapter=/home/soh1/miRExpress/bin/Trim_adapter
export Trim_adapter
$Trim_adapter -i hsa_read_count.txt -t 3_adaptor.txt -o hsa_trimmed.txt
head hsa_trimmed.txt
The output looks like this:
[soh1@master miRNA]$ head hsa_trimmed.txt
1 TCTGGACTCGGTGTCTTGTCTC
1 AGCGTCTGACTCTTT
1 AGCTACATCTGGCTACTGGGCCT
1 TCGTTCTCCATCCCGCCCTTC
1 TTCTGTGCTTTATTCTGCTTGG
1 AGAGGTCTACATTCGACGATC
1 TATGGGCTTCCATTTTTTATTAT
1 CTGATTGCTCCTTTCTGTTT
1 GATTGCACTTGTCTCGGTCTT
1 CAGCAGCGGGAGAGAAGT
Look at the distribution of short read length after the adaptor sequences removed. It’s evident that there are many short reads of length 22bps, which corresponds to mature miRNA length. Let’s print out the histogram that shows number of unique reads and number of reads including redundant ones.
[soh1@master miRNA]$ statistics_reads=/home/soh1/miRExpress/bin/statistics_reads
[soh1@master miRNA]$ export statistics_reads
[soh1@master miRNA]$ $statistics_reads -i hsa_trimmed.txt -o read_statistics.txt
[soh1@master miRNA]$ cat read_statistics.txt
length number of unique reads number of reads
15 3992 5240
16 3607 4204
17 3465 4113
18 3511 4085
19 3387 3862
20 3387 3725
21 4401 4887
22 13023 14444
23 8549 9655
24 3979 4291
25 2952 3102
26 1440 1496
27 595 605
28 708 710
A paper called “Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 2008, 18(4):610-621”, which delves into hESC’s miRNA analysis, has a good histogram example of data like above. And it looks like this:
However, I wanted to draw the barplot of the data that I procured. So I referred to this page about “side-by-side bar plot” to draw the bar graph by going through the following procedures.
read_stats <- read.table(file="read_statistics.txt", sep='\t', header=TRUE)
colnames(read_stats) <- c("length", "number_of_unique_reads", "number_of_reads")
read_stats
## length number_of_unique_reads number_of_reads
## 1 15 3992 5240
## 2 16 3607 4204
## 3 17 3465 4113
## 4 18 3511 4085
## 5 19 3387 3862
## 6 20 3387 3725
## 7 21 4401 4887
## 8 22 13023 14444
## 9 23 8549 9655
## 10 24 3979 4291
## 11 25 2952 3102
## 12 26 1440 1496
## 13 27 595 605
## 14 28 708 710
reorganized = data.frame(matrix(NA, nrow(read_stats)*2, ncol(read_stats)))
reorganized
## X1 X2 X3
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## 7 NA NA NA
## 8 NA NA NA
## 9 NA NA NA
## 10 NA NA NA
## 11 NA NA NA
## 12 NA NA NA
## 13 NA NA NA
## 14 NA NA NA
## 15 NA NA NA
## 16 NA NA NA
## 17 NA NA NA
## 18 NA NA NA
## 19 NA NA NA
## 20 NA NA NA
## 21 NA NA NA
## 22 NA NA NA
## 23 NA NA NA
## 24 NA NA NA
## 25 NA NA NA
## 26 NA NA NA
## 27 NA NA NA
## 28 NA NA NA
colnames(reorganized) <- c("length", "read_num", "type")
reorganized$length <- rep(read_stats$length, 2)
len_unique_reads<- length(read_stats$number_of_unique_reads)
len_all_reads <- length(read_stats$number_of_reads)
reorganized$read_num[1:len_unique_reads] <- read_stats$number_of_unique_reads
reorganized$read_num[(len_unique_reads+1):(len_unique_reads+len_all_reads)] <- read_stats$number_of_reads
reorganized$type <- as.factor(c(rep("unique_reads", len_unique_reads), rep("reads", len_all_reads)))
class(reorganized$length)
## [1] "integer"
reorganized
## length read_num type
## 1 15 3992 unique_reads
## 2 16 3607 unique_reads
## 3 17 3465 unique_reads
## 4 18 3511 unique_reads
## 5 19 3387 unique_reads
## 6 20 3387 unique_reads
## 7 21 4401 unique_reads
## 8 22 13023 unique_reads
## 9 23 8549 unique_reads
## 10 24 3979 unique_reads
## 11 25 2952 unique_reads
## 12 26 1440 unique_reads
## 13 27 595 unique_reads
## 14 28 708 unique_reads
## 15 15 5240 reads
## 16 16 4204 reads
## 17 17 4113 reads
## 18 18 4085 reads
## 19 19 3862 reads
## 20 20 3725 reads
## 21 21 4887 reads
## 22 22 14444 reads
## 23 23 9655 reads
## 24 24 4291 reads
## 25 25 3102 reads
## 26 26 1496 reads
## 27 27 605 reads
## 28 28 710 reads
library(ggplot2)
ggplot(data=reorganized, aes(x = length, y=read_num, fill=type)) +
geom_bar(stat="identity", position=position_dodge(), alpha = 0.75) +
geom_text(aes(label=read_num), fontface="bold", vjust=0.9,
position = position_dodge(0.9), size = 3.1) +
theme(axis.text.x= element_text(angle = 90, vjust = 0.5))
If you wanna save this plot into .png format, save it into a object (for example, myplot), use png("myplot.png"), and print(myplot), and dev.off().
myplot <- ggplot(data=reorganized, aes(x = length, y=read_num, fill=type)) +
geom_bar(stat="identity", position=position_dodge(), alpha = 0.75) +
geom_text(aes(label=read_num), fontface="bold", vjust=0.9,
position = position_dodge(0.9), size = 3.1) +
theme(axis.text.x= element_text(angle = 90, vjust = 0.5))
png("read_length_distribution_bar_plot.png")
print(myplot)
dev.off()
## png
## 2
Then it looks like this in the same directory:
It’s printed well.
Q. I still don’t know how to turn on the ticks for each integer.
Make a new directory called result.
[soh1@x020 miRNA]$ mkdir result
[soh1@x020 miRNA]$ ls
3_adaptor.txt hsa_precursor.txt miRExpress.2.1.4.tar.gz read_statistics.txt
expression_profile.txt hsa_read_count.txt miRNAData result
hsa.fastq hsa_trimmed.txt precursor_structure.txt
hsa_miRNA.txt miRExpress process.txt
Read alignment is the way of miRExpress to compare the reads to the previously known miRNA reference sequence. miRBase is a database that provides known miRNA reference sequences. Now that we have removed adaptor sequences from our read file, align it with the reference seqs. Novel miRNA fragments may be discovered through this process if we gather unaligned short reads. We gotta be careful that these novel fragments are not parts of mRNA or tRNA and there must be a step to filter these out. The exercise only performs alignment of miRNA to already known miRNA reference seq. The next job takes about 15 min. for calculation.
“alignmentSIMD” handles the alignment in query sequences and reference sequences.
The usage of alignmentSIMD is as follows:
alignmentSIMD [-r precursor miRNA file] [-i input sequence file] [-o output directory] [-t alignment identity, optional] [-n Rank nohit file] [-u Number of CPU for calculation]
[soh1@x020 miRNA]$ alignmentSIMD=/home/soh1/miRExpress/bin/alignmentSIMD
[soh1@x020 miRNA]$ export alignmentSIMD
$alignmentSIMD -u 4 -r hsa_precursor.txt -i hsa_trimmed.txt -o result/
For now, there are only the file nohit and directory Temp in result.
[soh1@x020 result]$ ls
nohit Temp
Open R.
getwd()
## [1] "C:/Users/me223w6wdp046/Desktop/miRNA"
dir()
## [1] "3_adaptor.txt"
## [2] "cap_miRSeq_pipeline.jpg"
## [3] "cts_txt.png"
## [4] "expression_profile.txt"
## [5] "hsa.fastq"
## [6] "hsa_miRNA.txt"
## [7] "hsa_precursor.txt"
## [8] "hsa_read_count.txt"
## [9] "hsa_trimmed.txt"
## [10] "mirexpress.gif"
## [11] "miRNA_and_non_coding_RNAs.html"
## [12] "miRNA_and_non_coding_RNAs.Rmd"
## [13] "miRNA_and_non_coding_RNAs_files"
## [14] "mirna_read_length_dist.PNG"
## [15] "miRNAData"
## [16] "mirnaSeq.PNG"
## [17] "nontumor_vs_blca_results_csv.PNG"
## [18] "nontumor_vs_blca_results_csv_2.PNG"
## [19] "nontumor_vs_blca_sig_results_csv.PNG"
## [20] "output_flowchart.png"
## [21] "precursor_structure.txt"
## [22] "printed_bar_plot.PNG"
## [23] "process.txt"
## [24] "read_length_distribution_bar_plot.png"
## [25] "read_statistics.txt"
## [26] "result"
Read the read_statistics.txt file and check its contents with read.delim and as.matrix. But you can also use read.table by specifying the separating delimiter as we did before.
stat <- as.matrix(read.delim("read_statistics.txt", header=TRUE, row.names=1))
stat
## number.of.unique.reads number.of.reads
## 15 3992 5240
## 16 3607 4204
## 17 3465 4113
## 18 3511 4085
## 19 3387 3862
## 20 3387 3725
## 21 4401 4887
## 22 13023 14444
## 23 8549 9655
## 24 3979 4291
## 25 2952 3102
## 26 1440 1496
## 27 595 605
## 28 708 710
stat[,2] <- stat[,2] - stat[,1]
colnames(stat) <- c("Unique", "Non-Unique")
stat
## Unique Non-Unique
## 15 3992 1248
## 16 3607 597
## 17 3465 648
## 18 3511 574
## 19 3387 475
## 20 3387 338
## 21 4401 486
## 22 13023 1421
## 23 8549 1106
## 24 3979 312
## 25 2952 150
## 26 1440 56
## 27 595 10
## 28 708 2
I guess I didn’t do it right by myself because I didn’t subtract unique (“Unique”) reads from all reads to get non-unique (“Non-Unique”) reads.
barplot(stat)
We can transpose
stat to get a stacked version of the barplot.
t(stat)
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## Unique 3992 3607 3465 3511 3387 3387 4401 13023 8549 3979 2952 1440 595 708
## Non-Unique 1248 597 648 574 475 338 486 1421 1106 312 150 56 10 2
rownames(stat)
## [1] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
colnames(stat)
## [1] "Unique" "Non-Unique"
You can intuitively see that this data format of a matrix will result in a stacked form of a barplot.
barplot(t(stat))
barplot(t(stat), main="Read Statistics", xlab="Read Length (bp)", ylab="The Number of Reads", col=c("black", "gray"))
legend("topleft", legend=c("Non-Unique", "Unique"), fill=c("gray", "black"))
Then this changes the story. Let’s get back to the ggplot I drew in section 2-3.
reorganized$type
## [1] unique_reads unique_reads unique_reads unique_reads unique_reads
## [6] unique_reads unique_reads unique_reads unique_reads unique_reads
## [11] unique_reads unique_reads unique_reads unique_reads reads
## [16] reads reads reads reads reads
## [21] reads reads reads reads reads
## [26] reads reads reads
## Levels: reads unique_reads
levels(reorganized$type)
## [1] "reads" "unique_reads"
levels(reorganized$type) <- c("Non-Unique", "Unique")
levels(reorganized$type)
## [1] "Non-Unique" "Unique"
reorganized$type
## [1] Unique Unique Unique Unique Unique Unique
## [7] Unique Unique Unique Unique Unique Unique
## [13] Unique Unique Non-Unique Non-Unique Non-Unique Non-Unique
## [19] Non-Unique Non-Unique Non-Unique Non-Unique Non-Unique Non-Unique
## [25] Non-Unique Non-Unique Non-Unique Non-Unique
## Levels: Non-Unique Unique
for (idx in 1:nrow(reorganized)){
if (reorganized$type[idx]=="Non-Unique"){
reorganized$read_num[idx] <- (reorganized$read_num[idx] - reorganized$read_num[(idx-14)])
}
}
reorganized
## length read_num type
## 1 15 3992 Unique
## 2 16 3607 Unique
## 3 17 3465 Unique
## 4 18 3511 Unique
## 5 19 3387 Unique
## 6 20 3387 Unique
## 7 21 4401 Unique
## 8 22 13023 Unique
## 9 23 8549 Unique
## 10 24 3979 Unique
## 11 25 2952 Unique
## 12 26 1440 Unique
## 13 27 595 Unique
## 14 28 708 Unique
## 15 15 1248 Non-Unique
## 16 16 597 Non-Unique
## 17 17 648 Non-Unique
## 18 18 574 Non-Unique
## 19 19 475 Non-Unique
## 20 20 338 Non-Unique
## 21 21 486 Non-Unique
## 22 22 1421 Non-Unique
## 23 23 1106 Non-Unique
## 24 24 312 Non-Unique
## 25 25 150 Non-Unique
## 26 26 56 Non-Unique
## 27 27 10 Non-Unique
## 28 28 2 Non-Unique
ggplot(data=reorganized, aes(x = length, y=read_num, fill=type)) +
geom_bar(stat="identity", position=position_dodge(), alpha = 0.75) +
geom_text(aes(label=read_num), fontface="bold", vjust=0.9,
position = position_dodge(0.9), size = 3.1) +
theme(axis.text.x= element_text(angle = 90, vjust = 0.5))
There it is.
Now, let’s get back to alignmentSIMD command. If it operated accordingly, you can observe that the acquired short reads are well aligned to the previously known miRNA sequences, that you can see. Check out the results yourself.
[soh1@x020 Temp]$ more miRNA900
AGCAGCATTGTACAGGGCTATGAAGAGGGCTATGA
cd ../..
“analysis” handles the result of alignment and constructs miRNA expression profiles.
Usage of analysis executable of miRExpress is as follows:
analysis [-r precursor miRNA file] [-m mature miRNA information in precursor sequences][-d alignment result directory] [-o output file name of the alignment between precursor miRNA and reads] [-t output file for miRNA expression]
Notice: Results will be stored in “alignment result directory (folder of parameter -d)”
[soh1@x020 miRNA]$ analysis=/home/soh1/miRExpress/bin/analysis
[soh1@x020 miRNA]$ export analysis
[soh1@x020 miRNA]$ $analysis -r hsa_precursor.txt -m hsa_miRNA.txt -d result/ -o read_align.txt -t expression.txt
* Note: hsa stands for homo sapiens.
You can see that expression level file and a file that shows alignment between precursor miRNA and reads are now present in /result.
[soh1@x020 result]$ ls
expression.txt nohit read_align.txt Temp
If you look at what expression.txt contains, looks like this:
[soh1@x020 result]$ head expression.txt
hsa-mir-28,hsa-miR-28-5p 2
hsa-mir-28,hsa-miR-28-3p 3
hsa-mir-1253,hsa-miR-1253 1
hsa-mir-200c,hsa-miR-200c-3p 6
hsa-mir-518c,hsa-miR-518c-3p 4
hsa-mir-148b,hsa-miR-148b-5p 1
hsa-mir-148b,hsa-miR-148b-3p 1
hsa-mir-515-1,hsa-miR-515-5p 1
hsa-mir-330,hsa-miR-330-3p 4
hsa-mir-143,hsa-miR-143-3p 5
You can also check out miRNA structural information from the aligned sequence info (read_align.txt).
[soh1@x020 result]$ head read_align.txt
hsa-mir-28
GGUCCUUGCCCUCAAGGAGCUCACAGUCUAUUGAGUUACCUUUCUGACUUUCCCACUAGAUUGUGAGCUCCUGGAGGGCAGGCACU
AAGGAGCUCACAGUCUAUUGAG hsa-miR-28-5p
AAGGAGCTCACAGTCTATTGA 1
CAAGGAGCTCACAGTCTATTGAG 1
CACUAGAUUGUGAGCUCCUGGA hsa-miR-28-3p
ACTAGATTGTGAGCTCCTGGAG 1
CACTAGATTGTGAGCTCCTGGAG 1
CACTAGATTGTGAGCTCCTGGA 1
//
You can set various options like mismatch number when aligning. Check out the options.
[soh1@x020 result]$ $analysis
Usage : analysis [-r precursor miRNA file] [-m mature miRNA information in precursor sequences] [-d alignment result directory] [-o output file name (expression)] [-t output file for comparing result,optional]
-r precursor miRNA file (This file must be the same as the file used to do alignmentSIMD)
-m mature miRNA information in precursor sequences (The format must be the same with miRExpress/data/hsa_miRNA.txt)
-d alignment result directory (This directory need to be the same with the directory used to do alignmentSIMD)
-o output file for alignment result
-t output file for expression result
-g tolerance range for mapping mature miRNA [optional, default:4]
-l similarity between read length and mature miRNA [optional, default:0.8]
-s precursor structure file [optional]
-g option sets the tolerance range for mapping mature miRNA. This means the tolerance range to allow in length when aligning an miRNA sequence to the reference. If the read candidate for alignment falls out of this range, it’s not considered as aligned.
-l option sets the sequence similarity lower limit between short read with variants and the mature form miRNA. The default value is set at 0.8.
The result of alignment and expression level analysis over numerous samples can be merged into one file that contains multiple samples’ expression profile. So what we see above in expression.txt is expanded into multiple rows with multiple expression data for multiple samples. Open expression_profile.txt and study its contents. In this practice, we analyze 6 people (3 normal (Normal_#), 3 lung cancer (Cancer_#))’s samples for 500 different miRNAs. The expression levels are arbitrarily assigned for the specific purpose of conducting this exercise.
exp <- as.matrix(read.table("expression_profile.txt", header=TRUE, row.names=1))
Read expression_profile.txt as a table form (data.frame) and then turn it into a matrix. header=TRUE to include header and row.names=1 to designate the first column’s values as row names in the newly generated data.frame.
dim(exp)
## [1] 500 6
head(exp)
## Normal_1 Normal_2 Normal_3 Cancer_1 Cancer_2
## hsa-mir-28,hsa-miR-28-5p 24 20 16 17 12
## hsa-mir-148b,hsa-miR-148b-3p 4 5 7 10 11
## hsa-mir-486,hsa-miR-486-5p 218 256 180 315 201
## hsa-mir-4284,hsa-miR-4284 0 0 0 0 0
## hsa-mir-4492,hsa-miR-4492 1 0 4 0 0
## hsa-mir-191,hsa-miR-191-5p 134 140 100 120 150
## Cancer_3
## hsa-mir-28,hsa-miR-28-5p 15
## hsa-mir-148b,hsa-miR-148b-3p 19
## hsa-mir-486,hsa-miR-486-5p 223
## hsa-mir-4284,hsa-miR-4284 0
## hsa-mir-4492,hsa-miR-4492 0
## hsa-mir-191,hsa-miR-191-5p 167
Data like this probably have to be normalized according to sample, experimental groups, experiment date, and experiment methods for analysis. Data normalization escapes the scope of this exercise, therefore just assume that this data has been preprocessed in terms of normalization and proceed.
One can use t-test to see if there’s a statistically significant difference between the two groups (in this case, normal and lung cancer groups). Stat test performed with significance level of 0.1. Since there exist miRNAs with 0 expression over all 6 samples, which are recorded as NA. Exclude these miRNAs and select for those with less than 0.1 significance level. In the following analysis, 17 out of 500 miRNAs showed differential expression with significance level of 0.1 and the list is available for viewing. Analyze the expression pattern by drawing a heatmap of miRNA expression levels.
p_values <- apply(exp, 1, function(x) as.numeric(t.test(x[1:3], x[4:6], var.equal=FALSE)$p.value))
This line of code can be interpreted as the following sentence: "apply the function(x), which is as.numeric(t.test(x[1:3], x[4:6], var.equal=FALSE)$p.value) to the expression profile matrix exp that has three samples from normal people and and three samples from lung cancer patients. With the argument that goes into var.equal=FALSE, the variance between two groups are considered NOT equal. After t-test is done Then only extract the p.value values from the resulting list.
class(apply(exp, 1, function(x) t.test(exp[1:3], x[4:6], var.equal=FALSE)))
## [1] "list"
It’s a list! R list is the object that contains elements of different types. Therefore it can store p.value, which is actually a character vector that needs to be converted to numeric vector.
If you compare the structure between exp and p_values, you can see that it’s tested whether there is a significant difference between normal sample group mean and cancer sample group mean.
head(exp)
## Normal_1 Normal_2 Normal_3 Cancer_1 Cancer_2
## hsa-mir-28,hsa-miR-28-5p 24 20 16 17 12
## hsa-mir-148b,hsa-miR-148b-3p 4 5 7 10 11
## hsa-mir-486,hsa-miR-486-5p 218 256 180 315 201
## hsa-mir-4284,hsa-miR-4284 0 0 0 0 0
## hsa-mir-4492,hsa-miR-4492 1 0 4 0 0
## hsa-mir-191,hsa-miR-191-5p 134 140 100 120 150
## Cancer_3
## hsa-mir-28,hsa-miR-28-5p 15
## hsa-mir-148b,hsa-miR-148b-3p 19
## hsa-mir-486,hsa-miR-486-5p 223
## hsa-mir-4284,hsa-miR-4284 0
## hsa-mir-4492,hsa-miR-4492 0
## hsa-mir-191,hsa-miR-191-5p 167
head(p_values)
## hsa-mir-28,hsa-miR-28-5p hsa-mir-148b,hsa-miR-148b-3p
## 0.13547995 0.09593935
## hsa-mir-486,hsa-miR-486-5p hsa-mir-4284,hsa-miR-4284
## 0.53639414 NaN
## hsa-mir-4492,hsa-miR-4492 hsa-mir-191,hsa-miR-191-5p
## 0.29985996 0.32129806
Take only the index of miRNAs that have p_values less than 0.1 and not NA.
index <- which(p_values < 0.1 & !is.na(p_values))
index
## hsa-mir-148b,hsa-miR-148b-3p hsa-mir-29a,hsa-miR-29a-3p
## 2 11
## hsa-mir-92a-1,hsa-miR-92a-3p hsa-mir-200b,hsa-miR-200b-3p
## 15 22
## hsa-mir-30c-2,hsa-miR-30c-5p hsa-mir-31,hsa-miR-31-5p
## 24 56
## hsa-mir-151a,hsa-miR-151a-3p hsa-mir-24-2,hsa-miR-24-3p
## 60 71
## hsa-mir-126,hsa-miR-126-5p hsa-mir-17,hsa-miR-17-5p
## 72 74
## hsa-mir-27b,hsa-miR-27b-3p hsa-mir-186,hsa-miR-186-5p
## 77 80
## hsa-mir-34a,hsa-miR-34a-5p hsa-mir-335,hsa-miR-335-5p
## 107 112
## hsa-mir-26b,hsa-miR-26b-5p hsa-mir-22,hsa-miR-22-3p
## 114 128
## hsa-mir-148a,hsa-miR-148a-5p
## 172
length(index)
## [1] 17
There are total 17 of miRNAs out of 500 that are significantly exhibit differential expression.
rownames(exp)[index]
## [1] "hsa-mir-148b,hsa-miR-148b-3p" "hsa-mir-29a,hsa-miR-29a-3p"
## [3] "hsa-mir-92a-1,hsa-miR-92a-3p" "hsa-mir-200b,hsa-miR-200b-3p"
## [5] "hsa-mir-30c-2,hsa-miR-30c-5p" "hsa-mir-31,hsa-miR-31-5p"
## [7] "hsa-mir-151a,hsa-miR-151a-3p" "hsa-mir-24-2,hsa-miR-24-3p"
## [9] "hsa-mir-126,hsa-miR-126-5p" "hsa-mir-17,hsa-miR-17-5p"
## [11] "hsa-mir-27b,hsa-miR-27b-3p" "hsa-mir-186,hsa-miR-186-5p"
## [13] "hsa-mir-34a,hsa-miR-34a-5p" "hsa-mir-335,hsa-miR-335-5p"
## [15] "hsa-mir-26b,hsa-miR-26b-5p" "hsa-mir-22,hsa-miR-22-3p"
## [17] "hsa-mir-148a,hsa-miR-148a-5p"
heatmap(exp[index,], margins=c(7, 14), col=heat.colors(12, rev=FALSE))
You could also do it this way, without decarling rev=FALSE:
heatmap(exp[index,], margins=c(7, 14), col=heat.colors(12))
Now we use DESeq2, an R package, for comparative analysis of differentially expressed miRNA expression level b/w normal cells and cancer cells. DESeq2 first performs normlization of miRNA-Seq data and then detects differentially expressed miRNA between the two groups. Analysis pipeline looks something like below:
The flowchart is generated by DiagrammeR. Refer to this page that shows how to build a simple flowchart with DiagrammeR.
library(DiagrammeR)
library(DiagrammeRsvg)
When I tried to install DiagrammeRsvg in Linux, error message shows up.
* installing *source* package ‘V8’ ...
** 패키지 ‘V8’는 성공적으로 압축해제되었고, MD5 sums 이 확인되었습니다
** using staged installation
Using PKG_CFLAGS=-I/usr/include/v8 -I/usr/include/v8-3.14
Using PKG_LIBS=-lv8 -lv8_libplatform
-----------------------------[ ANTICONF ]-------------------------------
Configuration failed to find the libv8 engine library. Try installing:
* deb: libv8-dev or libnode-dev (Debian / Ubuntu)
* rpm: v8-devel (Fedora, EPEL)
* brew: v8 (OSX)
* csw: libv8_dev (Solaris)
To use a custom libv8, set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
So I installed libv8-dev in the system, because V8 is required to run this package. It’s a JavaScript engine.
sudo apt install libv8-dev
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5;
}
[1]: 'Load DESeq2 input data'
[2]: 'Make DESeq2 data variable'
[3]: 'Differential miRNA expression analysis'
[4]: 'Shrink log2 fold change variance'
[5]: 'Extract differentially expressed miRNAs'
")
output_flowchart <- grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5;
}
[1]: 'Load DESeq2 input data'
[2]: 'Make DESeq2 data variable'
[3]: 'Differential miRNA expression analysis'
[4]: 'Shrink log2 fold change variance'
[5]: 'Extract differentially expressed miRNAs'
")
library(rsvg)
Error message:
* installing *source* package ‘rsvg’ ...
** 패키지 ‘rsvg’는 성공적으로 압축해제되었고, MD5 sums 이 확인되었습니다
** using staged installation
Package librsvg-2.0 was not found in the pkg-config search path.
Perhaps you should add the directory containing `librsvg-2.0.pc'
to the PKG_CONFIG_PATH environment variable
No package 'librsvg-2.0' found
Using PKG_CFLAGS=
Using PKG_LIBS=-lrsvg
--------------------------- [ANTICONF] --------------------------------
Configuration failed to find the librsvg-2.0 library. Try installing:
* deb: librsvg2-dev (Debian, Ubuntu, etc)
* rpm: librsvg2-devel (Fedora, EPEL)
* csw: librsvg_dev, sunx11_devel (Solaris)
* brew: librsvg (OSX)
If librsvg-2.0 is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a librsvg-2.0.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
-------------------------- [ERROR MESSAGE] ---------------------------
<stdin>:1:10: fatal error: librsvg/rsvg.h: 그런 파일이나 디렉터리가 없습니다
compilation terminated.
--------------------------------------------------------------------
flowchart_graph <- "digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5;
}
[1]: 'Load DESeq2 input data'
[2]: 'Make DESeq2 data variable'
[3]: 'Differential miRNA expression analysis'
[4]: 'Shrink log2 fold change variance'
[5]: 'Extract differentially expressed miRNAs'
"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
grViz(flowchart_graph) %>% export_svg %>% charToRaw %>% rsvg_png("output_flowchart.png")
Or, you could achieve this with simple three line code (it’s not currently working, there must be some kind of bug to be fixed):
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply, rowsum
The exercise data for DESeq2 DEG of miRNA-Seq short reads consists of 10 bladder cancer samples and 10 normal control samples. cts.txt file contains 1881 miRNA’s ID and corresponding expression level for all 20 samples. Columns B1~B10 represent bladder cancer samples, and N1~N10 represent normal control samples. Value in every row stands for each miRNA’s expression level expressed in short read count. The additional file coldata.txt classfies the histology of each sample as either bladder_cancer or non_tumor.
Let’s call cts.txt (containing miRNA expression level data per miRNA IDs in row and sample IDs in column) and coldata.txt (containing histology data that classifies each sample with tissu type; whether it is bladder_cancer or non_tumor type) into the workspace with the function DESeqDataSetFromMatrix and covert them into a data frame format required for DESeq2. Observe its structures.
Change directory for subsequent code chunks with knitr::opts_knit. I referred to this stackoverflow article, “In RStudio/RMarkdown, how to setwd?”, to solve this problem.
knitr::opts_knit$set(root.dir="./miRNAData/")
getwd()
## [1] "C:/Users/me223w6wdp046/Desktop/miRNA/miRNAData"
cts <- as.matrix(read.table(file="cts.txt", sep="\t", header=TRUE, row.names=1))
head(cts)
## B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 N1
## hsa-let-7a-1 39230 22089 12788 122225 45317 55839 29341 23347 48030 22903 8503
## hsa-let-7a-2 38937 22178 12770 121733 45218 55893 29572 23569 47822 22627 8286
## hsa-let-7a-3 39220 22494 12990 122028 45472 56051 29808 23258 48726 22793 8297
## hsa-let-7b 87195 28588 6749 121118 38997 50384 38167 27858 86704 23018 12683
## hsa-let-7c 1261 2449 1577 4792 236 5690 3538 3310 12777 3221 4262
## hsa-let-7d 2221 2135 875 2195 2168 1632 2146 1742 1785 1987 303
## N2 N3 N4 N5 N6 N7 N8 N9 N10
## hsa-let-7a-1 57605 32915 21624 21645 8186 25562 39101 86514 29017
## hsa-let-7a-2 57248 32797 21467 21638 8336 26018 38957 86180 28944
## hsa-let-7a-3 57404 32602 21656 21335 8462 25961 39682 86300 28836
## hsa-let-7b 75220 26676 23633 21871 9599 29301 59486 118660 35267
## hsa-let-7c 34630 17675 12156 7916 3605 14276 11771 20147 18441
## hsa-let-7d 4036 2015 1855 2266 577 974 1937 4443 1426
coldata<-as.matrix(read.table(file="coldata.txt", sep="\t", header=TRUE, row.names=1))
coldata
## tissuetype
## B1 "bladder_cancer"
## B2 "bladder_cancer"
## B3 "bladder_cancer"
## B4 "bladder_cancer"
## B5 "bladder_cancer"
## B6 "bladder_cancer"
## B7 "bladder_cancer"
## B8 "bladder_cancer"
## B9 "bladder_cancer"
## B10 "bladder_cancer"
## N1 "non_tumor"
## N2 "non_tumor"
## N3 "non_tumor"
## N4 "non_tumor"
## N5 "non_tumor"
## N6 "non_tumor"
## N7 "non_tumor"
## N8 "non_tumor"
## N9 "non_tumor"
## N10 "non_tumor"
Q. So why do we use matrix instead of data frame here?1
If the question is answered, go back to our DESeq2 analysis.
dds <- DESeqDataSetFromMatrix(countData = cts, colData=coldata, design=~tissuetype)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
So we give DESeq2 the expression level data of each sample and each miRNA through the matrix cts (matrix input) via argument countData=cts, and we give the classification data of different tissue types by the argument colData=coldata. We specify upon which grounds we will analyze the data against by giving the function a formula: design=~tissutype. This is the variable that will separate the groups and directs how the counts for each gene depends on the variables in colData. Many R formula are valid including designs with multiple variables, e.g., ~group + condition, and designs with interactions, e.g., ~genotype + treatment + genotype:treatment. Default setting is using the last variable in the formula for building results tables and plotting. Note that ~1 can be used for no design.
Summary:
countData=cts: input miRNA expression (miRNA read count) filecolData=coldata: input coldata file, dividing samples into tissue groupsdesign=~tissuetype how the gene counts depend on the variables in colData, in the form of an R formula.dds
## class: DESeqDataSet
## dim: 1881 20
## metadata(1): version
## assays(1): counts
## rownames(1881): hsa-let-7a-1 hsa-let-7a-2 ... hsa-mir-99a hsa-mir-99b
## rowData names(0):
## colnames(20): B1 B2 ... N9 N10
## colData names(1): tissuetype
It’s a DESeqDataSet class of dimension 1881 (miRNA) x20 (samples). If you delve into its intricacies…
head(rownames(dds))
## [1] "hsa-let-7a-1" "hsa-let-7a-2" "hsa-let-7a-3" "hsa-let-7b" "hsa-let-7c"
## [6] "hsa-let-7d"
colnames(dds)
## [1] "B1" "B2" "B3" "B4" "B5" "B6" "B7" "B8" "B9" "B10" "N1" "N2"
## [13] "N3" "N4" "N5" "N6" "N7" "N8" "N9" "N10"
colData(dds)
## DataFrame with 20 rows and 1 column
## tissuetype
## <factor>
## B1 bladder_cancer
## B2 bladder_cancer
## B3 bladder_cancer
## B4 bladder_cancer
## B5 bladder_cancer
## ... ...
## N6 non_tumor
## N7 non_tumor
## N8 non_tumor
## N9 non_tumor
## N10 non_tumor
Now, analyze DE miRNA b/w bladder cancer (bladder_cancer) and normal control (non_tumor) with function DESeq().
# Tests for miRNA expression difference between the tumor/non-tumor groups
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 62 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
# Save the differential expression (statistical) test results
res <- results(dds)
res
## log2 fold change (MLE): tissuetype non tumor vs bladder cancer
## Wald test p-value: tissuetype non tumor vs bladder cancer
## DataFrame with 1881 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## hsa-let-7a-1 31973.6879040257 0.596092398748664 0.214009477410276
## hsa-let-7a-2 31909.9797738312 0.595798376225158 0.212623392460365
## hsa-let-7a-3 32011.0843871558 0.587472763761105 0.210779620963768
## hsa-let-7b 38246.2959338692 0.579513078997914 0.272309200729157
## hsa-let-7c 10473.457853198 2.94159174796001 0.406534080719744
## ... ... ... ...
## hsa-mir-9500 0 NA NA
## hsa-mir-96 55.8215949587397 -3.70313512251679 0.546461392569371
## hsa-mir-98 241.165086519578 0.40707093561149 0.210082611831161
## hsa-mir-99a 1929.47516032742 2.20781867728465 0.424780313183046
## hsa-mir-99b 102041.439887595 0.766822992315878 0.340993781195923
## stat pvalue padj
## <numeric> <numeric> <numeric>
## hsa-let-7a-1 2.78535514390281 0.00534691245248623 0.0149951189223058
## hsa-let-7a-2 2.80212995066486 0.00507664199178354 0.0144295544901595
## hsa-let-7a-3 2.78714213961932 0.00531751494420203 0.0149792496865691
## hsa-let-7b 2.12814358621069 0.033325180472559 0.0705643922086736
## hsa-let-7c 7.2357814202246 4.62856277165851e-13 1.00711141686777e-11
## ... ... ... ...
## hsa-mir-9500 NA NA NA
## hsa-mir-96 -6.77657227550012 1.23060512922053e-11 2.09868063929231e-10
## hsa-mir-98 1.93767076705351 0.0526634026170181 0.103522140346849
## hsa-mir-99a 5.19755414449553 2.01927735020974e-07 2.03098913750403e-06
## hsa-mir-99b 2.24878878912835 0.0245259369761443 0.0554690545947923
So the test results contain mean expression level (baseMean), log2 fold change (log2FoldChange), standard error (lfcSE), and FDR (false discovery rate) adjusted p-value. Refer to this article (Why, When and How to Adjust Your P Values?) if you’re curious about why there is the adjusted p-value for multiple hypothesis testing; mainly because of the probability of observing a false positive is inflated when testing over multiple hypotheses.
Also, this (Introduction to DGE) is a good cover on how these figures (numbers) are generated through DE analysis and which mathematical reasoning was used behind the scenes.
miRNA log2 fold change variance shrink method between the groups is recommended because if there is already a small number of miRNA short read present, fold change changes drastically even with a small difference. This results in increased false-positives (false discovery rates). Therefore, “log2 fold change variance shrinkage” is a statistical method to undermine such instability. We sue log2 fold change variance shrinkage in this exercise as well in order to get significantly increased or decreased miRNAs exhibited in bladder cancer tissues.
In the next code chunk, we’re gonna generate an MA plot that prints log2 fold change as the vertical axis, before and after log2 fold change variance shrinking is applied.
resultsNames(dds)
## [1] "Intercept"
## [2] "tissuetype_non_tumor_vs_bladder_cancer"
lfcShrink: Adds shrunken log2 fold changes (LFC) and SE to a results table from DESeq run without LFC shrinkage. Since this step requires the package apeglm, I install it. apeglm stands for “Approximate Posterior Estimation for GLM”.
resLFC <- lfcShrink(dds, coef="tissuetype_non_tumor_vs_bladder_cancer", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
# shrinks log 2 fold change size to perform the miRNA differential expression evaluation
# Check the results of log2 fold change shrinkage.
resLFC
## log2 fold change (MAP): tissuetype non tumor vs bladder cancer
## Wald test p-value: tissuetype non tumor vs bladder cancer
## DataFrame with 1881 rows and 5 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## hsa-let-7a-1 31973.6879040257 0.586236278842151 0.21086006909514
## hsa-let-7a-2 31909.9797738312 0.584851865658086 0.209553038044322
## hsa-let-7a-3 32011.0843871558 0.574992834702292 0.207782737436832
## hsa-let-7b 38246.2959338692 0.554839740530518 0.266085060263389
## hsa-let-7c 10473.457853198 2.84466970832791 0.410533180906413
## ... ... ... ...
## hsa-mir-9500 0 NA NA
## hsa-mir-96 55.8215949587397 -3.56400379762341 0.544784903466227
## hsa-mir-98 241.165086519578 0.392606295515246 0.206784295947913
## hsa-mir-99a 1929.47516032742 2.08871463379366 0.42927951800603
## hsa-mir-99b 102041.439887595 0.675916100128576 0.329888304225142
## pvalue padj
## <numeric> <numeric>
## hsa-let-7a-1 0.00534691245248623 0.0149951189223058
## hsa-let-7a-2 0.00507664199178354 0.0144295544901595
## hsa-let-7a-3 0.00531751494420203 0.0149792496865691
## hsa-let-7b 0.033325180472559 0.0705643922086736
## hsa-let-7c 4.62856277165851e-13 1.00711141686777e-11
## ... ... ...
## hsa-mir-9500 NA NA
## hsa-mir-96 1.23060512922053e-11 2.09868063929231e-10
## hsa-mir-98 0.0526634026170181 0.103522140346849
## hsa-mir-99a 2.01927735020974e-07 2.03098913750403e-06
## hsa-mir-99b 0.0245259369761443 0.0554690545947923
Now, draw MA-plot off of it.
par(mfrow=c(1,2)) # Set the layout of the plot as 1 row by 2 columns.
plotMA(res, main="raw read counts") # raw miRNA read count's MA plot
plotMA(resLFC, main="log2 fold change shrinked read counts")
# Variance shrinked log2 fold change of read counts as MA plot
Next code is arranging the miRNAs with p-value in the order of least to largest and then getting how many miRNAs there are that meet the criteria of less than 0.1 of FDR-adjusted p-values (that significant level). The resulting info are stored in resSig variable and saved as a table format file with write.csv() function.
resOrdered <- resLFC[order(resLFC$pvalue),]
summary(resLFC)
##
## out of 1189 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 122, 10%
## LFC < 0 (down) : 192, 16%
## outliers [1] : 0, 0%
## low counts [2] : 558, 47%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Sum all the cases (TRUE) that have adjusted p-value that is less than 0.1. Remove those with NA value. Most R statistic functions have na.rm=TRUE option so that you can exclude missing (NA) values.
sum (resLFC$padj < 0.1, na.rm=TRUE)
## [1] 314
There are total of 314 significantly differentially expressed miRNAs.
Save the results (resOrdered) as a .csv file with the function write.csv.
write.csv(as.data.frame(resOrdered), file="nontumor_vs_blca_results.csv")
It looks like this.
…
You can see that it includes ones that have p-value that are higher than significance level.
Select and subset only the significant ones in the differential expression results (resOrdered).
resSig <- subset(resOrdered, padj < 0.1)
head(resSig)
## log2 fold change (MAP): tissuetype non tumor vs bladder cancer
## Wald test p-value: tissuetype non tumor vs bladder cancer
## DataFrame with 6 rows and 5 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## hsa-mir-17 1948.44642001348 -1.81732606702651 0.17146475578801
## hsa-mir-20a 790.301018506143 -2.38200737517424 0.230515534514516
## hsa-mir-93 15197.1606085431 -2.17821936099175 0.218887101836711
## hsa-mir-19a 75.4976556200038 -2.59230052034485 0.277296171369147
## hsa-mir-518b 121.736612289972 -0.476698677024373 1.1987657611053
## hsa-mir-143 5871618.66029456 5.64594506629235 0.55124893023589
## pvalue padj
## <numeric> <numeric>
## hsa-mir-17 8.56387432933486e-27 5.40380470181029e-24
## hsa-mir-20a 9.46758073817448e-26 2.98702172289405e-23
## hsa-mir-93 4.2808043625091e-24 9.00395850914414e-22
## hsa-mir-19a 2.51049208436663e-21 3.69832010863287e-19
## hsa-mir-518b 2.93052306547771e-21 3.69832010863287e-19
## hsa-mir-143 1.74742061482019e-19 1.65120250564666e-17
Write the significantly DE miRNAs to a .csv file.
write.csv(as.data.frame(resSig), file="nontumor_vs_blca_sig_results.csv")
Now, note that if you get an error message Error in file(file, ifelse(append, "a", "w")) : 커넥션을 열 수 없습니다, it’s usually about permission issue or the file is currently on the memory, being used by other applications like excel.
Unlike the one above, this one contains only really differentially expressed miRNAs that have very low p-values.
Alright, now let’s take a look at what differentially expressed miRNAs pertain to. Top six significantly differentially expressed miRNAs are hsa-mir-17, hsa-mir-20a, hsa-mir-93, hsa-mir-518b, hsa-mir-143.
hsa-mir-17 has something to do with stem cell differentiation and cancer biology. It’s downregulated.
miRBase is a database that has assorted information like sequence data and annotation data of miRNAs. If you search for hsa-mir-221 in the searchbox, you can get mature form miRNA sequence and read count. Also, you can download the experimental data yourself and annotation data as well. It provides many more data on species other than homo sapiens.
So, TarBase provides miRNA infos plus target gene infos. Target genes are important because they are the target on which miRNAs act on and regulate the gene expression and functions. It’s got data on about 65,000 target genes that are experimentally verified. So if you’re exploring the miRNA and target gene interactions, TarBase may prove useful. Although not experimentally validated, it also provides TargetScan that estimates and conjectures a probable site of regulation (target gene) through bioinfo methods.
miRNA-Seq data analysis pipeline based on Java. It can be used to work on from FASTQ raw data to clipping, mapping, and differentially expressed miRNA detection. it’s a well-established one-in-all pipeline and you can get your results in MS Excel spreadsheet file, which is familiar to MS Windows users.
Based on Perl. Tool that finds miRNAs that are already known or novel. This tools is able to detect miRNAs with high accuracy in 7 species.
It’s got a variety of analytical tools to comprise of a miRNA analytical pipeline. It can process files at the level of raw reads (FASTQ, FASTAQ.gz) and perform subsequent steps of adapter cutting (trimming), alignment, known/novel miRNA identification, DE miRNA analysis. One-in-all pipeline.
The paper is here.
Refer to this question thread in stackoverflow. It says that you use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type and therefore are more memory efficient. Tbis can be illustrated as follows: m = matrix(1:4, 2, 2) d = as.data.frame(m) object.size(m) # 216 bytes object.size(d) # 792 bytes↩︎